Hapax - Enriching Reverse Engineering with Semantic Clustering

نویسنده

  • Adrian Kuhn
چکیده

Many reverse engineering approaches focus on structural information and ignore semantic information like the naming of identifiers or comments. But developers put their domain knowledge into exactly these parts of the source code. Without understanding the semantics of the code, one cannot tell its meaning. We use Latent Semantic Indexing, an information retrieval technique [3], to retrieve the semantic similarity between different entities (e.g. whole systems, classes and methods), and then we cluster these entities according to their similarity. We employ this technique to characterize entities by clustering their sub entities [2]. For example, we cluster classes to characterize the system. Furthermore, we use the same technique to recover the most relevant labels for the clusters. We implemented this approach in a tool called Hapax, which is built on top of the Moose reengineering environment [1]. Figure 1 emphasizes the interactive nature of our tool. On the top side we show the main window of Hapax. On left part of the window is the correlation matrix visualization: the darker the dot, the more similar the two entities. The top two panels on the right show the entities on the current row and column. The bottom-right panel shows the labels attached to the current cluster. On the right side of the window there is a slider for setting the clustering threshold. When the slider is moved, the picture is redrawn with the new clusters. On the bottom side of the figure we show how we use LSI to also search over the entities in the system. The top window contains the search query and the result is shown in the below window with the group of the entities ordered by their relevancy to the query.

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Centralized Clustering Method To Increase Accuracy In Ontology Matching Systems

Ontology is the main infrastructure of the Semantic Web which provides facilities for integration, searching and sharing of information on the web. Development of ontologies as the basis of semantic web and their heterogeneities have led to the existence of ontology matching. By emerging large-scale ontologies in real domain, the ontology matching systems faced with some problem like memory con...

متن کامل

SUPPLEMENTARY NOTES for Origin of co-expression patterns in E.coli and S.cerevisiae emerging from reverse engineering algorithms

1 Methods and data 1 1.1 Data collected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Overrepresented networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Statistical analysis . . . . . . . . . . . ...

متن کامل

Reverse Engineering of Network Software Binary Codes for Identification of Syntax and Semantics of Protocol Messages

Reverse engineering of network applications especially from the security point of view is of high importance and interest. Many network applications use proprietary protocols which specifications are not publicly available. Reverse engineering of such applications could provide us with vital information to understand their embedded unknown protocols. This could facilitate many tasks including d...

متن کامل

The Impact of Semantic Clustering on Iranian EFL Advanced Learners’ Vocabulary Retention

This study investigated the impact of semantic clustering on Iranian EFL learners’ vocabulary retention at advanced level. Participants were female learners randomly assigned to two groups of 15. Four instruments (TOEFL test; vocabulary pretest; immediate posttest, and delayed recall posttest) were used. The experimental group underwent semantic clustering vocabulary presentation in which the l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005